knitr::opts_chunk$set(cache = TRUE)
This is working from the word doc and html file about ANOVA
work through the ANOVA html document
Reading in the rarefied OTU table
# going to where the file is
setwd("~/Dropbox/gradschool/TA/BIOL2002/bioinf/analysis/taxasummaryplots_ANOVA/")
# using read.delim becaue I don't know the separator
otu <- read.table(file = "otu_table_d8000.txt", # the file to read in
sep = "\t", # setting the field separator as a tab
skip = 1, # skipping the first comment line
comment.char = "", # the header of the file starts with a #
as.is = T, # read in the text as character strings
check.names = FALSE,
header = TRUE) # there are colnames in the file, so use them
Don’t use file.choose()
What kind of cols did we get?
names(otu)
## [1] "#OTU ID" "700033789" "700110415" "700114832" "700099121"
## [6] "700108086" "700110349" "700102581" "700037714" "700110808"
## [11] "700038795" "700016099" "700102647" "700024001" "700106052"
## [16] "700038736" "700114649" "700024701" "700110748" "700098589"
## [21] "700107806" "700100647" "700016059" "700038938" "700099617"
## [26] "700037732" "700106916" "700038998" "700103587" "700105705"
## [31] "700016333" "700114854" "700114747" "700110689" "700016374"
## [36] "700106617" "700037702" "700014994" "700111841" "700103623"
## [41] "700102924" "700114444" "700023477" "700110746" "700095429"
## [46] "700038978" "700103517" "700099408" "700109551" "700101269"
## [51] "700097718" "700114767" "700106089" "700106563" "700024149"
## [56] "700097650" "700023313" "700107864" "700106637" "700114872"
## [61] "700097589" "700103653" "700114818" "700097313" "700014956"
## [66] "700013596" "700095235" "700101366" "700103710" "700023267"
## [71] "700038950" "700103303" "700111775" "700101335" "700016405"
## [76] "700108266" "700024097" "700095275" "taxonomy"
For future work, know that the first column are the OTU ids, and the last column are taxonomy of the OTU
Finding the number of duplicated taxons. The duplicated function returns a logical vector testing to see if a certain entry is duplicated, a TRUE if it is, FALSE if it is not.
head(duplicated(otu$table)) # take a look at the beginning of the vector
## logical(0)
The cool thing with a vector of logicals, is that you can sum them, and TRUE is counted as 1, and FALSE is 0 – think binary. So we can sum that vector to see how many are duplicate
dups <- sum(duplicated(otu$taxonomy))
Note: the numbers that I am going to get will be different than yours – you should know why. If not, think about the otu table I am reading in, and the one that you are reading in.
Now, how many unique taxa?
uniq <- length(unique(otu$taxonomy))
dups and uniq should sum to be the same as the length of otu$taxonomy …
dups + uniq == length(otu$taxonomy)
## [1] TRUE
Nice.
We’re going to use the aggregate function to aggregate the count data based on taxonomy of the OTU. For this we need to leave out the OTU ID, simply subset the dataframe with otu[, -1] … then also the taxonomy column, which is the last column. I’m going to use dim to find the last column by getting the second value that dim returns (dim returns the number of rows, then cols). Using the - notation to deselect rows, I need to make a vector of the two cols I don’t want. so it’s going to look like otu[ , -c(1, dim(otu)[2])]. My final aggregate statement is below.
otu.m <- aggregate( otu[, -c(1, dim(otu)[2])],
by = list(otu$taxonomy),
FUN=sum)
dim(otu.m)
## [1] 578 78
Nice.
Looking at how it changed things:
names(otu.m)
## [1] "Group.1" "700033789" "700110415" "700114832" "700099121"
## [6] "700108086" "700110349" "700102581" "700037714" "700110808"
## [11] "700038795" "700016099" "700102647" "700024001" "700106052"
## [16] "700038736" "700114649" "700024701" "700110748" "700098589"
## [21] "700107806" "700100647" "700016059" "700038938" "700099617"
## [26] "700037732" "700106916" "700038998" "700103587" "700105705"
## [31] "700016333" "700114854" "700114747" "700110689" "700016374"
## [36] "700106617" "700037702" "700014994" "700111841" "700103623"
## [41] "700102924" "700114444" "700023477" "700110746" "700095429"
## [46] "700038978" "700103517" "700099408" "700109551" "700101269"
## [51] "700097718" "700114767" "700106089" "700106563" "700024149"
## [56] "700097650" "700023313" "700107864" "700106637" "700114872"
## [61] "700097589" "700103653" "700114818" "700097313" "700014956"
## [66] "700013596" "700095235" "700101366" "700103710" "700023267"
## [71] "700038950" "700103303" "700111775" "700101335" "700016405"
## [76] "700108266" "700024097" "700095275"
So … what changed??
# restoring correctness to the world
names(otu.m)[1] <- "taxonomy"
number of reads equal for all samples?
sum(colSums(otu.m[,-1]) == 8000) == dim(otu.m)[2] -1
## [1] TRUE
made an equality statement, asking if there are the same number of TRUEs as there are sample columns.
So I am going to be making a function to turn the otu table from counts to proportions.
ConvertToProportionOTU <- function(count.otu) {
# copying the count.otu table into prop.otu table so that I can store
# the values into it later.
prop.otu <- count.otu
for(i in 2:ncol(otu.m)){
prop.otu[,i] <- 100*count.otu[,i]/sum(count.otu[,i])
}
return(prop.otu)
}
otu.prop.m <- ConvertToProportionOTU(count.otu = otu.m)
So, now checking the colsums of the pro
head(colSums(otu.prop.m[ ,-1]))
## 700033789 700110415 700114832 700099121 700108086 700110349
## 100 100 100 100 100 100
We are going to filter out the taxa that occurr in fewer than 4 individuals.
Breaking down what the rowSums(otu.m[,-1] > 0) statement does:
otu.m[,-1] > 0 asks which elements in the dataframe are greater than 0. It creates a logical matrix (all TRUE or FALSE)rowSums sums each row, remember TRUE is a 1, and FALSE is a 0then we go and see which rows have more than 4 samples that contain that OTU
# define a cutoff
cutoff <- rowSums(otu.m[,-1] > 0) > 4
# what is that doing?
cutoff
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
## [12] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [45] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## [56] TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [67] FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [100] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [111] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [122] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## [133] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [144] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
## [155] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## [166] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
## [199] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [210] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [221] FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## [232] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE
## [243] TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [254] FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [265] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
## [276] TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [287] TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## [298] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [309] TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [320] TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [331] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [342] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [353] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [364] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [375] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [386] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [408] FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [419] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [430] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## [441] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [452] FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## [463] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
## [474] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
## [485] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [496] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [507] TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
## [518] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
## [529] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [540] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
## [551] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [562] FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
## [573] FALSE TRUE TRUE TRUE FALSE TRUE
sum(cutoff==T) # how many ones are we going to keep?
## [1] 251
# filter your OTU table based on that cutoff
otu.prop.filtered <- otu.prop.m[cutoff,]
# what is this doing?
length(otu.prop.filtered[,1])==sum(cutoff==T)
## [1] TRUE
Loading in some more packages
library(car)
# apparently the lme4 package is not needed?
Making the function that they gave us
boxCoxNormalize <- function(x){
if(length(x)<=1){
x;
}
else{
if(sum(x<=0)>0){
x=x-min(x)+0.0001;
}
p=powerTransform(x)[6]$start;
xnorm=(x^p -1)/p
xnorm
}
}
Doing what they gave us … changing the names because my dataframes are special snowflakes
# Define the first column of otu.m as the variable 'taxonomy'
taxonomy = otu.prop.filtered[,1]
# boxCoxNormalize() requires a matrix with only numeric data
otu.n <- as.matrix(otu.prop.filtered[,2:ncol(otu.prop.filtered)])
# apply boxCoxNormalize() using a for loop to the new matrix
for (i in 1:nrow(otu.n)){
otu.n[i,] = boxCoxNormalize(otu.n[i,])
}
# add 'taxonomy' back as column #1 to otu.n. Why is the as.data.frame function important here?
otu.prop.normalized <- cbind(taxonomy, as.data.frame(otu.n))
It’s a party
metadata <- read.table('HMP_5BS_metadata.txt', header=T, sep='\t', check.names=F)
what variable names do we have?
names(metadata)
## [1] "SampleID" "BarcodeSequence" "LinkerPrimerSequence"
## [4] "Sex" "BodySite" "SRS_SampleID"
## [7] "FASTA_FILE" "Description"
matching the metadata to the samples that we have:
# finding the samples in the metadata that match the samples we have
metadata <- metadata[as.character(metadata$SampleID) %in%
colnames(otu.prop.normalized[,2:ncol(otu.prop.normalized)]),]
# finding the order that the rows match the colums
mm <- match(colnames(otu.prop.normalized[, -1]),
as.character(metadata$SampleID))
# reordering the metadata based on matching rows
metadata <- metadata[mm,]
# checking to see if the names are the same
as.character(metadata$SampleID) ==
colnames(otu.prop.normalized[,2:ncol(otu.prop.normalized)])
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [71] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
sex <- metadata$Sex
sex <- factor(sex,c("male","female"))
bs <- metadata$BodySite
bs <- factor(bs,c("Subgingival_plaque","Saliva","Left_Retroauricular_crease","Stool","Mid_vagina"))
ba <- metadata$Description
ba <- factor(ba,c("Urogenital_tract","Oral","Gastrointestinal_tract","Skin"))
pvals.sex <- vector()
pvals.bs <- vector()
pvals.ba <- vector()
# assign otu.m[1,] to variable taxon1
y <- as.numeric(otu.prop.normalized[8,2:ncol(otu.prop.normalized)])
# do an ANOVA on otu.m[,1] and sex, to test for significnat differences in distributions of the taxon in row #1 between males and females
aovOut <- aov(y ~ sex)
# view the output of the ANOVA using summary(aovOut)
summary(aovOut)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 5236234 5236234 2.606 0.111
## Residuals 75 150691463 2009220
Doing all this stuff with for loops
# so I renamed my out table to otu.prop.normalized
# since the rest of the code is with otu.m, I'm just going to rename my dataframe
# as otu.m
otu.m <- otu.prop.normalized
for(i in 1:nrow(otu.m)){
y <- as.numeric(otu.m[i,2:ncol(otu.m)])
aovOut <- aov(y ~ sex)
pvals.sex[i] <- summary(aovOut)[[1]][1,5]
}
for(i in 1:nrow(otu.m)){
y <- as.numeric(otu.m[i,2:ncol(otu.m)])
aovOut <- aov(y ~ bs)
pvals.bs[i] <- summary(aovOut)[[1]][1,5]
}
for(i in 1:nrow(otu.m)){
y <- as.numeric(otu.m[i,2:ncol(otu.m)])
aovOut <- aov(y ~ ba)
pvals.ba[i] <- summary(aovOut)[[1]][1,5]
}
pvals.sex = p.adjust(pvals.sex, "fdr")
pvals.bs = p.adjust(pvals.bs, "fdr")
pvals.ba = p.adjust(pvals.ba, "fdr")
How many p-vals are significant
sum(pvals.sex < 0.05)
## [1] 4
sum(pvals.bs < 0.05)
## [1] 225
sum(pvals.ba < 0.05)
## [1] 224
Which OTUs are significant?
otu.m[which(pvals.sex < 0.05),1]
otu.m[which(pvals.bs < 0.05),1]
otu.m[which(pvals.ba < 0.05),1]
… this dumps a ton of output
par(mar=c(2,2,2,2),mgp=c(0.5,1.2,0))
par(xpd=T)
par(mfrow=c(1,1))
for(i in which(pvals.bs < 0.05)){
y = as.numeric(otu.m[i,2:ncol(otu.m)])
name1 = gsub('; .__',' ', as.character(otu.m[i,1]))
name2 = gsub('k__','', name1)
name3 = gsub(' ','', name2)
name = strsplit(name3, " ", fixed=T)[[1]]
names_tail = tail(name, n=2)
boxplot(y ~ bs, main=names_tail, ylab="Relative abundance", col=c("tomato", "gray", "gold","tan4","dodgerblue"))}
This picks up in the ANOVA html document where the otu_tax_lev() is mentioned
Like in the Taxa summary assignment, I saved the otu_tax_lev() function to a different R script, and now I can source from that, keeping my analysis script clean. Nice.
Also I need to load plyr since the function uses it. If you get an error loading this, make sure it’s installed (install.packages("plyr"))
# going to where the script file is
setwd("~/Dropbox/gradschool/TA/BIOL2002/bioinf/scripts/")
source("otu_tax_lev.R") # note, plyr gets loaded during the source of this
Now I need to go to the dir where my data files are, since the file needs to be able to load the files (which isn’t a good practice)
setwd("~/Dropbox/gradschool/TA/BIOL2002/bioinf/analysis/taxasummaryplots_ANOVA/")
otu_TL.phylum <- otu_tax_lev('otu_table_d8000.txt', TL='phylum')
# what does the output look like?
str(otu_TL.phylum)
## List of 2
## $ otu_table: num [1:20, 1:77] 7236 150 337 17 259 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:20] "k__Bacteria; p__Firmicutes" "k__Bacteria; p__Actinobacteria" "k__Bacteria; p__Proteobacteria" "k__Bacteria; p__Bacteroidetes" ...
## .. ..$ : chr [1:77] "700033789" "700110415" "700114832" "700099121" ...
## $ otuNames : chr [1:20] "k__Bacteria; p__Firmicutes" "k__Bacteria; p__Actinobacteria" "k__Bacteria; p__Proteobacteria" "k__Bacteria; p__Bacteroidetes" ...
head(otu_TL.phylum$otu_table)
## 700033789 700110415 700114832 700099121
## k__Bacteria; p__Firmicutes 7236 7799 7835 6250
## k__Bacteria; p__Actinobacteria 150 102 115 909
## k__Bacteria; p__Proteobacteria 337 20 17 181
## k__Bacteria; p__Bacteroidetes 17 77 17 537
## k__Bacteria; p__Tenericutes 259 2 9 116
## k__Bacteria; p__Fusobacteria 1 0 7 7
## 700108086 700110349 700102581 700037714
## k__Bacteria; p__Firmicutes 3543 7970 7988 7337
## k__Bacteria; p__Actinobacteria 4188 13 6 46
## k__Bacteria; p__Proteobacteria 224 5 4 394
## k__Bacteria; p__Bacteroidetes 16 10 2 110
## k__Bacteria; p__Tenericutes 0 0 0 92
## k__Bacteria; p__Fusobacteria 6 1 0 0
## 700110808 700038795 700016099 700102647
## k__Bacteria; p__Firmicutes 7474 7760 6444 7732
## k__Bacteria; p__Actinobacteria 216 7 139 19
## k__Bacteria; p__Proteobacteria 14 15 80 6
## k__Bacteria; p__Bacteroidetes 270 201 891 124
## k__Bacteria; p__Tenericutes 22 15 33 119
## k__Bacteria; p__Fusobacteria 0 2 352 0
## 700024001 700106052 700038736 700114649
## k__Bacteria; p__Firmicutes 6605 7817 7979 7531
## k__Bacteria; p__Actinobacteria 92 63 9 298
## k__Bacteria; p__Proteobacteria 1098 81 9 32
## k__Bacteria; p__Bacteroidetes 128 12 0 138
## k__Bacteria; p__Tenericutes 0 0 1 1
## k__Bacteria; p__Fusobacteria 0 1 2 0
## 700024701 700110748 700098589 700107806
## k__Bacteria; p__Firmicutes 2446 76 3461 1218
## k__Bacteria; p__Actinobacteria 5467 7733 3731 6740
## k__Bacteria; p__Proteobacteria 33 58 583 33
## k__Bacteria; p__Bacteroidetes 14 83 190 5
## k__Bacteria; p__Tenericutes 1 0 8 0
## k__Bacteria; p__Fusobacteria 3 38 20 2
## 700100647 700016059 700038938 700099617
## k__Bacteria; p__Firmicutes 570 3315 1638 3352
## k__Bacteria; p__Actinobacteria 7355 418 2415 138
## k__Bacteria; p__Proteobacteria 47 833 312 2877
## k__Bacteria; p__Bacteroidetes 17 2782 1754 829
## k__Bacteria; p__Tenericutes 0 7 0 3
## k__Bacteria; p__Fusobacteria 10 489 1711 327
## 700037732 700106916 700038998 700103587
## k__Bacteria; p__Firmicutes 2396 1383 2621 4648
## k__Bacteria; p__Actinobacteria 3890 1339 3848 469
## k__Bacteria; p__Proteobacteria 695 3303 268 869
## k__Bacteria; p__Bacteroidetes 578 1731 237 1272
## k__Bacteria; p__Tenericutes 0 0 0 0
## k__Bacteria; p__Fusobacteria 385 243 264 704
## 700105705 700016333 700114854 700114747
## k__Bacteria; p__Firmicutes 1973 1540 1879 559
## k__Bacteria; p__Actinobacteria 811 2014 2315 2723
## k__Bacteria; p__Proteobacteria 354 319 505 1358
## k__Bacteria; p__Bacteroidetes 2383 1940 1906 2124
## k__Bacteria; p__Tenericutes 16 19 0 15
## k__Bacteria; p__Fusobacteria 2252 1382 1154 1019
## 700110689 700016374 700106617 700037702
## k__Bacteria; p__Firmicutes 1170 1355 3326 4611
## k__Bacteria; p__Actinobacteria 2023 1181 254 3247
## k__Bacteria; p__Proteobacteria 531 782 675 123
## k__Bacteria; p__Bacteroidetes 2353 2441 1940 10
## k__Bacteria; p__Tenericutes 0 23 117 1
## k__Bacteria; p__Fusobacteria 1739 1377 133 1
## 700014994 700111841 700103623 700102924
## k__Bacteria; p__Firmicutes 1125 1926 3045 1218
## k__Bacteria; p__Actinobacteria 423 90 134 1
## k__Bacteria; p__Proteobacteria 559 440 2518 337
## k__Bacteria; p__Bacteroidetes 3075 4463 1619 6442
## k__Bacteria; p__Tenericutes 40 0 1 0
## k__Bacteria; p__Fusobacteria 2203 877 292 0
## 700114444 700023477 700110746 700095429
## k__Bacteria; p__Firmicutes 6454 4579 1791 1848
## k__Bacteria; p__Actinobacteria 1494 8 1343 3
## k__Bacteria; p__Proteobacteria 43 6 1582 165
## k__Bacteria; p__Bacteroidetes 4 3400 2367 5901
## k__Bacteria; p__Tenericutes 0 4 0 0
## k__Bacteria; p__Fusobacteria 0 1 914 1
## 700038978 700103517 700099408 700109551
## k__Bacteria; p__Firmicutes 2619 3657 3864 1910
## k__Bacteria; p__Actinobacteria 11 183 440 1286
## k__Bacteria; p__Proteobacteria 191 1171 2246 795
## k__Bacteria; p__Bacteroidetes 4995 2491 902 2321
## k__Bacteria; p__Tenericutes 121 9 1 0
## k__Bacteria; p__Fusobacteria 1 440 422 1685
## 700101269 700097718 700114767 700106089
## k__Bacteria; p__Firmicutes 2160 3844 1305 7937
## k__Bacteria; p__Actinobacteria 788 265 7 3
## k__Bacteria; p__Proteobacteria 1176 1234 558 47
## k__Bacteria; p__Bacteroidetes 2465 2315 6130 9
## k__Bacteria; p__Tenericutes 0 0 0 0
## k__Bacteria; p__Fusobacteria 1391 331 0 0
## 700106563 700024149 700097650 700023313
## k__Bacteria; p__Firmicutes 4997 1949 3012 5914
## k__Bacteria; p__Actinobacteria 2840 4315 27 29
## k__Bacteria; p__Proteobacteria 136 1616 15 161
## k__Bacteria; p__Bacteroidetes 26 97 3820 1827
## k__Bacteria; p__Tenericutes 0 0 903 24
## k__Bacteria; p__Fusobacteria 1 13 0 1
## 700107864 700106637 700114872 700097589
## k__Bacteria; p__Firmicutes 433 496 1852 2073
## k__Bacteria; p__Actinobacteria 7555 7261 3018 12
## k__Bacteria; p__Proteobacteria 5 242 82 114
## k__Bacteria; p__Bacteroidetes 5 1 2848 5431
## k__Bacteria; p__Tenericutes 0 0 199 2
## k__Bacteria; p__Fusobacteria 2 0 0 1
## 700103653 700114818 700097313 700014956
## k__Bacteria; p__Firmicutes 1644 1765 1953 2135
## k__Bacteria; p__Actinobacteria 3 16 2 27
## k__Bacteria; p__Proteobacteria 1368 878 372 299
## k__Bacteria; p__Bacteroidetes 4985 5341 5641 5537
## k__Bacteria; p__Tenericutes 0 0 0 0
## k__Bacteria; p__Fusobacteria 0 0 2 1
## 700013596 700095235 700101366 700103710
## k__Bacteria; p__Firmicutes 2774 1403 1786 2144
## k__Bacteria; p__Actinobacteria 21 2 23 12
## k__Bacteria; p__Proteobacteria 133 20 551 471
## k__Bacteria; p__Bacteroidetes 4478 6381 5633 5371
## k__Bacteria; p__Tenericutes 4 188 0 0
## k__Bacteria; p__Fusobacteria 0 0 0 0
## 700023267 700038950 700103303 700111775
## k__Bacteria; p__Firmicutes 2768 105 2849 2969
## k__Bacteria; p__Actinobacteria 1 2 2607 188
## k__Bacteria; p__Proteobacteria 202 9 344 1509
## k__Bacteria; p__Bacteroidetes 5016 7884 1342 2194
## k__Bacteria; p__Tenericutes 0 0 804 0
## k__Bacteria; p__Fusobacteria 0 0 0 1110
## 700101335 700016405 700108266 700024097
## k__Bacteria; p__Firmicutes 1633 2177 5489 1492
## k__Bacteria; p__Actinobacteria 1700 4136 720 3730
## k__Bacteria; p__Proteobacteria 366 568 408 2768
## k__Bacteria; p__Bacteroidetes 2572 338 1191 4
## k__Bacteria; p__Tenericutes 0 0 0 0
## k__Bacteria; p__Fusobacteria 1671 769 184 3
## 700095275
## k__Bacteria; p__Firmicutes 7771
## k__Bacteria; p__Actinobacteria 53
## k__Bacteria; p__Proteobacteria 94
## k__Bacteria; p__Bacteroidetes 76
## k__Bacteria; p__Tenericutes 6
## k__Bacteria; p__Fusobacteria 0
So, look at how the otu_table is set up. Use a function to get the dimensions
For finding the number of phyla, we probably want the number of unique phyla, so that is:
# function that finds the number of unique things ...
## [1] "k__Bacteria; p__Firmicutes" "k__Bacteria; p__Actinobacteria"
## [3] "k__Bacteria; p__Proteobacteria" "k__Bacteria; p__Bacteroidetes"
## [5] "k__Bacteria; p__Tenericutes" "k__Bacteria; p__Fusobacteria"
## [7] "k__Bacteria; p__Verrucomicrobia" "k__Bacteria; p__Cyanobacteria"
## [9] "k__Bacteria; p__Spirochaetes" "k__Bacteria; p__TM7"
## [11] "k__Bacteria; p__Lentisphaerae" "k__Bacteria; p__[Thermi]"
## [13] "k__Bacteria; p__Synergistetes" "k__Bacteria; p__Acidobacteria"
## [15] "k__Bacteria; p__SR1" "k__Archaea; p__Euryarchaeota"
## [17] "k__Bacteria; p__Fibrobacteres" "k__Bacteria; p__Chloroflexi"
## [19] "k__Bacteria; p__GN02" "k__Bacteria; p__Planctomycetes"
filtering out taxa that occur in fewer than 5 inds … look to the earlier walkthrough on how to do this…. but then later it said at the genus level, so then you need to use otu_tax_lev() to get those with genus information
## [1] 158
Normalizing the data
# Define the first column of otu.m as the variable 'taxonomy'
taxonomy = otu.gen.filtered[,1]
# boxCoxNormalize() requires a matrix with only numeric data
otu.n <- as.matrix(otu.gen.filtered[,2:ncol(otu.gen.filtered)])
# apply boxCoxNormalize() using a for loop to the new matrix
for (i in 1:nrow(otu.n)){
otu.n[i,] = boxCoxNormalize(otu.n[i,])
}
## Warning in estimateTransform.default(X, Y, weights, family, start,
## method, : Convergence failure: return code = 52
## Warning in estimateTransform.default(X, Y, weights, family, start,
## method, : Convergence failure: return code = 52
# add 'taxonomy' back as column #1 to otu.n. Why is the as.data.frame function important here?
otu.gen.normalized <- cbind(taxonomy, as.data.frame(otu.n))
matching the metadata to the samples that we have:
# finding the samples in the metadata that match the samples we have
metadata <- metadata[as.character(metadata$SampleID) %in%
colnames(otu.gen.normalized[,2:ncol(otu.gen.normalized)]),]
# finding the order that the rows match the colums
mm <- match(colnames(otu.gen.normalized[, -1]),
as.character(metadata$SampleID))
# reordering the metadata based on matching rows
metadata <- metadata[mm,]
# checking to see if the names are the same
as.character(metadata$SampleID) ==
colnames(otu.gen.normalized[,2:ncol(otu.gen.normalized)])
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [71] TRUE TRUE TRUE TRUE TRUE TRUE
sex <- metadata$Sex
sex <- factor(sex,c("male","female"))
bs <- metadata$BodySite
bs <- factor(bs,c("Subgingival_plaque","Saliva","Left_Retroauricular_crease","Stool","Mid_vagina"))
ba <- metadata$Description
ba <- factor(ba,c("Urogenital_tract","Oral","Gastrointestinal_tract","Skin"))
pvals.sex <- vector()
pvals.bs <- vector()
pvals.ba <- vector()
Doing all this stuff with for loops
otu.m <- otu.gen.normalized
for(i in 1:nrow(otu.m)){
y <- as.numeric(otu.m[i,2:ncol(otu.m)])
aovOut <- aov(y ~ sex)
pvals.sex[i] <- summary(aovOut)[[1]][1,5]
}
for(i in 1:nrow(otu.m)){
y <- as.numeric(otu.m[i,2:ncol(otu.m)])
aovOut <- aov(y ~ bs)
pvals.bs[i] <- summary(aovOut)[[1]][1,5]
}
for(i in 1:nrow(otu.m)){
y <- as.numeric(otu.m[i,2:ncol(otu.m)])
aovOut <- aov(y ~ ba)
pvals.ba[i] <- summary(aovOut)[[1]][1,5]
}
pvals.sex = p.adjust(pvals.sex, "fdr")
pvals.bs = p.adjust(pvals.bs, "fdr")
pvals.ba = p.adjust(pvals.ba, "fdr")
How many p-vals are significant
sum(pvals.sex < 0.05)
## [1] 3
sum(pvals.bs < 0.05)
## [1] 146
sum(pvals.ba < 0.05)
## [1] 145
Which OTUs are significant?
otu.m[which(pvals.sex < 0.05),1]
otu.m[which(pvals.bs < 0.05),1]
otu.m[which(pvals.ba < 0.05),1]
… this dumps a ton of output
We can use this code to find out how many taxa differ significantly at the genus level:
sum(pvals.sex < 0.05)
sum(pvals.bs < 0.05)
sum(pvals.ba < 0.05)
So, for my dataset, there are 3 genera that differ based on sex, 146 genera that differ based on body site, and 145 genera that differ based on body area
so, it’s pretty much follow the walkthrough and copy-paste commands – so go forth and have fun.